Skip to content

Comments

ci(kai): add KaiBench evaluation workflow to CI pipeline [AI-2588]#399

Open
jordanrburger wants to merge 8 commits intomainfrom
AI-2588-kaibench-ci-evals
Open

ci(kai): add KaiBench evaluation workflow to CI pipeline [AI-2588]#399
jordanrburger wants to merge 8 commits intomainfrom
AI-2588-kaibench-ci-evals

Conversation

@jordanrburger
Copy link
Contributor

Description

Linear: AI-2588

Change Type

  • Major (breaking changes, significant new features)
  • Minor (new features, enhancements, backward compatible)
  • Patch (bug fixes, small improvements, no new features)

Summary

Adds automated KaiBench evaluations to the CI pipeline so MCP server changes are tested against the full AI agent stack before merging. This catches regressions where tool description changes, argument schema modifications, or behavior changes break the agent's ability to answer questions correctly.

How it works:

  • New kaibench.yml reusable workflow builds the MCP server Docker image from the PR branch, starts the full stack (MCP server + kai-assistant + Postgres + Redis), clones KaiBench, and runs evaluations against 3 question types (Data Analysis Query, Configuration Reasoning, Storage Object Reasoning)
  • ci.yml calls this workflow as a non-blocking job after the build passes (same-repo pushes only)
  • Results appear as a detailed GitHub step summary with per-question pass/fail, tool call counts, token usage, and duration
  • Artifacts (summary.json + results.jsonl) uploaded with 90-day retention
  • Also supports workflow_dispatch for manual runs with configurable question types

Before merging — secrets required:

Secret Purpose
KAIBENCH_REPO_TOKEN GitHub PAT to clone keboola-rnd/KaiBench
KAIBENCH_STATIC_TOKEN Storage API token for canary-orion project 293
KAIBENCH_MANAGEMENT_TOKEN Management API token
KAIBENCH_API_URL https://connection.canary-orion.keboola.dev
DOCKERHUB_TOKEN Docker Hub (for kai-assistant image)
KAI_GOOGLE_VERTEX_CREDENTIALS Google Vertex AI service account JSON
KAI_GOOGLE_VERTEX_PROJECT Vertex project ID
KAI_GOOGLE_VERTEX_LOCATION Vertex location
TURBO_TOKEN Turbo monorepo cache (for building kai-assistant from source)

Testing

  • Tested with Cursor AI desktop (Streamable-HTTP transports)

N/A — this is a CI-only change (GitHub Actions workflows). Testing plan:

  • Add required secrets to repo settings
  • Trigger kaibench.yml via workflow_dispatch to validate the full stack starts and evals run
  • Verify GitHub Actions step summary renders the results table
  • Download artifacts and verify summary.json + results.jsonl are present

Checklist

  • Self-review completed
  • Unit tests added/updated (if applicable) — N/A, CI workflow only
  • Integration tests added/updated (if applicable) — N/A, CI workflow only
  • Project version bumped according to the change type (if applicable) — N/A
  • Documentation updated (if applicable)

🤖 Generated with Claude Code

Add kaibench.yml reusable workflow that builds MCP server from the PR
branch, starts the full stack (MCP server + kai-assistant + Postgres +
Redis), and runs KaiBench evaluations. Triggered as a non-blocking job
in ci.yml after the build passes (same-repo pushes only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@linear
Copy link

linear bot commented Feb 21, 2026

AI-2588 Evals in CI/CD

jordanrburger and others added 7 commits February 21, 2026 14:35
Restricts GITHUB_TOKEN to read-only contents access to satisfy CodeQL
security policy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cli.py imports requests (for requests.JSONDecodeError handler) but it
was not declared in pyproject.toml. This caused test collection to fail
on Python 3.11 in CI where no transitive dependency pulls it in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The repo was renamed from keboola/keboola-mcp-server to
keboola/mcp-server, causing the if condition to evaluate false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Always use pre-built kai-assistant Docker image tag instead of building
from the UI repo source. This removes the need for UI repo access and
TURBO_TOKEN. The image tag is provided via vars.KAI_ASSISTANT_IMAGE_TAG.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of requiring a static image tag, the workflow now queries
Docker Hub for the latest production-kai-assi-* tag when none is
provided. This ensures CI always tests against the newest kai-assistant
without manual variable updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove silent curl flags to surface errors when Docker Hub API
calls fail during kai-assistant tag resolution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of running the full evaluation locally (which needs UI repo
access, TURBO_TOKEN, and registry credentials), dispatch the eval
to the KaiBench repo where all secrets are centralized.

Results are posted back as a commit status on the MCP server repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant